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Abstract.  A  framework  for  learning  optimal  dictionaries  for  simulta¬ 
neous  sparse  signal  representation  and  robust  class  classification  is  in¬ 
troduced  in  this  paper.  This  problem  for  dictionary  learning  is  solved 
by  a  class-dependent  supervised  simultaneous  orthogonal  matching  pur¬ 
suit,  which  learns  the  intra-class  structure  while  increasing  the  inter-class 
discrimination,  interleaved  with  an  efficient  dictionary  update  obtained 
via  singular  value  decomposition.  This  framework  addresses  for  the  first 
time  the  explicit  incorporation  of  both  reconstruction  and  discrimina¬ 
tion  terms  in  the  non-parametric  dictionary  learning  and  sparse  coding 
energy.  The  work  contributes  to  the  understanding  of  the  importance  of 
learned  sparse  representations  for  signal  classification,  showing  the  rel¬ 
evance  of  learning  discriminative  and  at  the  same  time  reconstructive 
dictionaries  in  order  to  achieve  accurate  and  robust  classihcation.  The 
presentation  of  the  underlying  theory  is  complemented  with  examples 
with  the  standard  MNIST  and  Caltech  datasets,  and  results  on  the  use 
of  the  sparse  representation  obtained  from  the  learned  dictionaries  as 
local  patch  descriptors,  replacing  commonly  used  experimental  ones. 

1  Introduction 

The  study  of  sparse  representations  has  become  a  major  field  of  research  in  signal 
processing.  Efforts  have  been  focused  mainly  on  the  development  of  theoretical 
frameworks  (e.g.,  [2,  5]),  algorithms  to  efficiently  perform  sparse  coding  (e.g.,  [4, 
8,21]),  learning  of  overcomplete  sets  of  vectors  denoted  as  dictionaries  (e.g.,  [1, 
23,28]),  and  applications  in  image  processing  (e.g.,  [6,20,28]).  Sparse  represen¬ 
tations  over  non-parametric  learned  dictionaries  lead  to  state-of-the-art  results 
for  image  enhancement  [20]. 

Since  originally  trained  to  contain  sufficient  information  for  reconstruction, 
sparse  representations  are,  from  the  point  of  view  of  signal  classification,  a  re¬ 
constructive  approach.  This  provides  representations  that  are  relatively  robust 
against  distortions  and  missing  data.  On  the  other  hand,  discriminative  methods 
have  as  criteria  the  classification  performance  itself,  an  element  that  has  not  been 
significantly  addressed  yet  in  the  sparsity  (non-parametric)  dictionary  learning 
community  and  that  constitutes  one  of  our  key  contributions  in  this  work.  Dis¬ 
criminative  methods  often  outperform  reconstructive  ones  in  ideal  conditions, 
but  lack  robustness. 
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Our  proposed  framework  introduces  a  novel  metric  which  includes  both  re¬ 
construction  and  discrimination  terms  in  the  dictionary  learning  process,  ben- 
efitting  from  the  best  of  both  discriminative  and  reconstructive  worlds.  This  is 
incorporated  into  a  new  energy,  inspired  by  the  framework  put  forward  in  [1, 
20] ,  leading  to  the  learning  of  adapted  dictionaries  and  sparse  discriminative  and 
reconstructive  image  representations  with  them.  These  learned  dictionaries  pro¬ 
vide  robust  discriminant  representations  through  adaptation  to  the  dataset.  Our 
proposed  framework  is  based  on  the  concept  of  obtaining  simultaneous  sparse 
decompositions  within  each  class,  so  as  to  extract  its  internal  structure,  while 
keeping  a  global  discrimination  term  among  different  classes.  Such  explicit  and 
efficient  incorporation  of  the  task  (classification)  into  the  dictionary  learning  for 
sparse  coding  is  unique  and  a  key  novelty  of  the  proposed  work. 

1.1  Related  Work  and  Our  Contribution 

Huang  and  Aviyente,  [11],  proposed  the  marriage  of  discrimination  and  recon¬ 
struction  in  sparse  image  representations,  introducing  a  novel  discrimination 
term  into  the  classical  reconstructive  energy  formulation  of  sparse  coding.  Their 
approach  proved  to  yield  robust  and  discriminant  image  representations  through 
an  intrinsic  dimensionality  reduction.  In  contrast  with  our  proposed  framework, 
there  is  no  dictionary  learning  in  [11],  and  they  use  pre-defined  dictionaries  and 
sparse  coding  over  them.  As  shown  below,  and  it  is  further  supported  by  the  im¬ 
age  processing  literature  on  non-parametric  dictionary  learning,  adapted  learned 
dictionaries  outperform  off-the-shelf  ones.  Effrosyni  and  Frossard  introduced  a 
similar  algorithm  [14],  which  they  named  Supervised  Simultaneous  Orthogonal 
Matching  Pursuit  or  SSOMP.  Simultaneous  sparse  decompositions,  which  are 
applied  to  the  whole  dataset/class  at  once,  are  proven  to  be  essential  in  order 
to  extract  the  structure  of  a  class  and  to  help  capturing  its  intrinsic  variability. 

LeCun  et  al.  introduced  an  algorithm  for  learning  sparse  representations, 
based  on  a  energy  model,  through  a  linear  coder  and  a  linear  decoder  [26,27]. 
This  is  based  on  a  (coordinate)  sparsifying  logic  quite  different  from  our  princi¬ 
ple  of  sparse  coding.  The  work  includes  a  complex  neural  network,  with  multiple 
layers  and  training  steps.  The  dictionary  design  and  neural  network  training  are 
based  on  different  criteria,  it  is  not  clear  how  much  of  the  outstanding  perfor¬ 
mance  is  due  to  each  part.  No  results  are  given  concerning  robustness,  which 
are  needed  to  verify  weather  the  properties  from  sparse  coding  have  been  inher¬ 
ited.  Our  objective  is  learning  representations  that  by  themselves  are  discrimina¬ 
tive  and  robust,  leading  both  to  different  energy  formulations  and  optimization 
techniques.  We  include  explicit  discriminative  terms,  which  are  absent  in  the 
framework  in  [26,27],  in  addition  to  the  reconstructive  one. 

Lazebnik  and  Raginsky  recently  introduced  an  elegant  dictionary  learning  al¬ 
gorithm  based  on  Information  Loss  Minimization  [16].  It  learns  a  codebook  with 
the  objective  of  obtaining  a  quantization  that  does  not  cause  high  distortion  and 
at  the  same  time  keeps  nearly  all  the  information  about  the  class  of  the  original 
signal.  There  is  no  explicit  discrimination  constraints,  although  classification  is 
the  main  goal.  Leibe  and  collaborators,  e.g.,  [7, 17],  have  proposed  in  a  series  of 
leading  works  the  use  of  learned  dictionaries,  these  obtained  from  clustering  of 
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image  patches,  without  explicit  sparsity,  reconstruction,  and/or  discrimination 
requirements.  Sparsity  also  has  been  recently  incorporated  into  a  very  interest¬ 
ing  robust  face  recognition  framework  by  Ma  et  al.,  [32],  motivated  by  the  work 
on  compressed  sensing  and  random  projections.  This  work  does  not  explicitly 
enforce  reconstruction  and/or  discrimination  neither  it  learns  adapted  dictio¬ 
naries.  In  [9]  and  companion  papers,  the  authors  develop  an  efficient  li-based 
optimization  approach  for  learning  dictionaries  for  sparse  representation,  much 
of  the  spirit  of  [1],  again  with  an  energy  tuned  to  reconstruction  only,  and  use 
the  coefficients  of  the  overcomplete  representation  for  classification.  The  min¬ 
imal  reconstruction  error  from  multiple  generative-only  dictionaries,  each  one 
independently  learned  for  a  different  class,  is  used  in  [25]  for  classifying  tex¬ 
tures.  Finally,  we  should  mention  that  the  work  on  epitomes,  [12],  provides  a 
different  generative-only  model  for  learning  dictionaries  that  can  also  be  used 
for  recognition  [15]. 

Contributions:  In  contrast  with  previous  approaches,  the  framework  here 
proposed  learns  a  non-parametric  dictionary  which  is  efficient  for  sparsely  rep¬ 
resenting  a  signal  and  at  the  same  time  performing  class  discrimination.  Such 
dictionary  and  sparse  representation  are  derived  from  the  efficient  minimiza¬ 
tion  of  an  energy  that  explicitly  includes  these  critical  components.  This  novel 
classification  framework  can  be  seen  as  a  deviation  and  step  forward  from  ap¬ 
proaches  that  either  use  off-the-shelf  dictionaries  and  features  (e.g.,  [11,18,22, 
31]),  or  learn  dictionaries  without  explicit  discrimination  and/or  reconstruction 
goals  (meaning  they  obtain  or  learn  dictionaries  with  a  criteria  that  often  does 
not  explicitly  include  the  actual  application  and  performance  criteria) .  Although 
such  alternative  approaches  have  performed  outstandingly,  their  actual  optimal¬ 
ity,  performance,  and  limitation  studies  have  been  purely  experimental.  The 
underlying  idea  behind  our  proposed  framework  is  to  start  gearing  toward  the 
design  of  feature  detectors  and  non-parametric  dictionaries  that  are  designed 
and  optimized  for  the  task  at  hand. 

2  Learning  Dictionaries  for  Representation  and 
Discrimination 

We  now  build,  step-by-step,  the  proposed  framework  for  learning  discriminative 
and  representative  non-parametric  dictionaries. 


2.1  Supervised  Sparse  Coding 

For  sparse  coding,  we  will  extend,  by  adding  discrimination  power,  the  Simul¬ 
taneous  Orthogonal  Matching  Pursuit  (SOMP)  algorithm,  see  [24,29,30]  for  de¬ 
tails  and  theoretical  results  on  this  greedy  technique.  Given  a  dictionary  matrix 
D  e  (which  we  will  later  learn),  that  contains  K  atoms  G  7^" 

{K  >  n),  and  a  set  of  signals  G  7^”,  SOMP  attempts  to  represent  these 

signals  at  once  as  a  linear  combination  of  a  common  subset  of  atoms  of  cardi¬ 
nality  much  smaller  than  n  (sparse  representation).  Under  the  assumption  that 
those  signals  belong  to  a  certain  class,  SOMP  attempts  to  extract  their  common 
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internal  structure.  By  keeping  the  sparsity  low  enough,  the  internal  variation  of 
the  class  could  be  eliminated,  leading  to  more  accurate  classification  while  being 
robust  to  noise.  After  adding  classification  terms  into  SOMP,  see  next,  we  will 
explicitly  use  these  coefficients  of  sparse  representation,  over  a  discriminative 
learned  dictionary,  for  classification. 

To  further  increase  the  inherent  discriminant  capacity  of  SOMP,  we  will  next 
incorporate  into  SOMP  a  discrimination  measure  inspired  by  linear  discriminant 
analysis  (LDA)  (on  top  of  the  original  reconstruction  component,  see  also  [11, 
14]),  the  quotient  of  the  I2  norms  of  the  corresponding  scatter  matrices.  Given 
c  sets  of  vectors,  each  one  representing  one  class,  with  j  G  {1, ...,  c}  and 

a-  G  R^,  we  propose  as  linear  discrimination  measure  := 

trace'(s wt+At  ’  Standard  between- classes  and  within- 

class  scatter  matrices  and  /r  is  a  regularization  parameter.^ 

Following  the  introduction  of  the  discriminative  measure  T(-),  the  originally 
purely  reconstructive  objective  of  SOMP  is  modified  incorporating  this  discrim¬ 
ination  measure  over  the  sparse  representation  coefficients  corre¬ 
sponding  to  each  one  of  the  c  classes.  For  a  dictionary  and  a  set  of 

indices  /\,  let  G  7^”^  I A  I  be  the  matrix  whose  columns  are  the  d^,  i  G  /\.  Let 
be  the  signals  form  the  j-th  class,  j  G  {1,  ...,c}  (e.g.,  images  in  column 
representation),  and  X  the  matrix  with  columns  We  state  the 

Simultaneous  Sparse  Discriminant  Problem  as 

aTm<l  -  liX  -  (1) 

Here,  L  is  the  sparsity  factor  indicating  how  many  atoms  are  used  to  represent 
the  signals,  H  is  simply  the  orthogonal  projection  of  the  signal  onto  the  selected 
set  ,  and  6*  is  a  parameter  that  controls  the  trade-off  between  the  discrimi¬ 
native  term  (first  component  of  (1))  and  the  reconstruction  term  (second  com¬ 
ponent  of  (1),  where  jj  •  [If  stands  for  the  Frobenius  norm,  and  only  one  present 
in  the  original  SOMP).  9  is  dynamically  updated  (see  also  [14]).^  We  propose 
a  greedy  approach  to  address  this  optimization,  denoted  as  Supervised  SOMP 
(SSOMP),  see  Figure  1  (in  the  following  we  omit  the  dynamic  dependency  of  6 
in  the  notation). 

Considering  that  the  sparsity  coefficients  are  aj  =  (  xl,  the  cor¬ 
responding  scatter  matrices  of  i}^=i  and  verify 

SA(a)  =  (^^^/\)-^^^SA(x)^/\($^$/y)-S  where  A  G  {B,W}.  Under  the 
assumption  that  the  degree  of  correlation  is  low,  can  be  approxi¬ 

mated  by  the  identity  and  thus  we  decompose  the  contribution  of  each  vector, 

^  Ideally,  we  would  like  to  use  the  product  of  the  positive  eigenvectors  of  those  matrices. 
Since  the  determinant  is  zero  (c  <  n),  it  is  not  possible  to  use  it.  We  then  chose  the 
summation  to  yield  our  discrimination  term. 

2  ffd)  « —  g  .  j_  Yliii  I  <  >  I,  with  being  the  previous 

residual,  Figure  1. 
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Input:  dictionary  D  £  ,  signals  ,  sparsity  level  L. 

Convention:  is  an  empty  matrix. 

Output:  reconstruction  ,  sparsity  coefficients 


SSOMP: 


1.  Initialize  the  residuals  =  xl,  /\g  =  0,  t  =  1. 

2.  Find  the  index  At  (break  the  tie  deterministically  when  needed)  that  solves“ 


At  =  argmax 


EE  1C 


<  rf 


-1) 


3.  Update  sets  At  =  At_iU{'’'<}  >  =  [^t~i,dAj. 

4.  Compute  new  coefficients  (sparse  representations),  reconstructions,  and  residu¬ 
als 


=  argmin  ||xi  —  =  (^t^ ^t) 


5.  t  =  t  -I-  1.  If  t  <  L  go  back  to  2. 

6.  Return  estimates  coefficients  indices  An 


Next  best  coefficient/atom  to  simultaneously  provide  good  reconstruction  and 
good  discrimination. 


Fig.  1.  Supervised  SOMP  (SSOMP). 

trace{SA{a))  «  Y.Li  (x)dAi,  where,  as  in  Figure  1,  A  =  {•^i:  ^l}-  This 

quantity  can  be  greedily  calculated.  Furthermore,  to  yield  a  better  estimate,  we 
evaluate  each  one  of  the  summation  terms  in  this  expression  over  the  residuals, 
so  that  the  non  orthogonality  is  better  taken  into  account.  This  is  equivalent 
to  the  way  that  classical  OMP  treats  correlations  in  the  orthogonal  projection 
to  evaluate  the  reconstruction  error.  Furthermore,  it  is  equivalent  to  applying 
a  one-dimension  dimensionality  reduction  over  the  residual,  so  that  we  can  di¬ 
rectly  use  J,  as  stated  in  the  algorithm.  Finally,  note  that  there  is  no  need  to 
build  any  matrix  explicitly,  since  it  can  be  directly  evaluated. 


2.2  Class  Supervised  SOMP 

The  SOMP  extracts  the  common  coherent  internal  structure  of  a  given  class. 
However,  when  dealing  with  signals  from  different  classes,  this  coherence  does 
not  exist  any  more.  Thus  the  number  of  atoms  needed  to  give  a  proper  char¬ 
acterization  of  all  the  classes  with  SOMP  is  larger  than  that  needed  for  each 
class  individually.  As  a  consequence,  the  representation  captures  more  of  the 
intra-class  variance,  decreasing  the  classification  performance.  Performing  one 
independent  SOMP  per  class,  and  joining  the  sets  of  atoms  selected  could  be 
even  worse,  since  there  may  well  be  a  minimum  common  structure  among  the 
classes  and  redundancies  could  arise,  giving  rise  to  problems  such  as  multiple 
representations.  This  fact  is  critical  when  the  atoms  have  been  trained  for  re- 
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construction  tasks,  since  a  small  number  of  them  can  highly  accurately  describe 
the  signals. 

In  order  to  achieve  the  goal  of  seeking  internal  structure  within  each  class 
and  at  the  same  time  global  discrimination  among  the  classes,  we  propose  the 
Class  Supervised  SOMP  (CSSOMP)  algorithm  in  Figure  2.  The  reconstruction 
term  is  treated  class  per  class,  whereas  the  discrimination  one  is  always  global. 

Now  that  we  know  how  to  sparsely  encode  signals  to  simultaneously  achieve 
discrimination  and  reconstruction  (thereby  robustness),  it  is  time  to  optimize  the 
dictionary  to  the  data  and  task,  bringing  an  additional  novelty  to  the  framework. 


Input:  dictionary  D  G  ,  signals  snch  that  Vi,  j  xl  G  R”,  and 

sparsity  level  L. 

Output:  ,  selected  set  Pc- 

CSSOMP: 

1.  Initialize  the  class  connter  q  =  1  and  Pq  = 

2.  Selection  of  L  vectors  according  to  the  strncture  of  the  class  q. 

1. q.  Initialize  the  residuals  =  xj,  the  index  set  /\q  =  0,  and  the  iteration 

counter  t  =  1. 

2. q.  Find  the  index  A|  that  solves  (break  the  tie  deterministically  when  needed) 


Xl  =  argmax\^\{vl^^  dp)| +6i  •  jf{{<  .  dp  J 

^  Global 

Class  — q 

3. q.  A?  =  A?-iU{A?}, 

4. q.  Compute  new  sparsity  coefficients,  reconstrnctions,  and  residnals 

Q,h(t)  ^  (^UT^qwl^qT  i 

j.(t)  _  /I.q  .  ^ 


5. q.  i  =  i  +  1.  If  i  <  L  go  back  to  2.q.. 

6. q.  For  the  class  q  the  estimates  are  and  it’s  coefficients  of  sparse  rep¬ 


resentation  are  a- 


q.(L) 


3.  Save  only  and  Vi  G  {1,  ...,nq} 

4.  Pq  =  Pg 

5.  Retnrn 


4.  Pg  =  Pg-\  U  AI,  g  =  g  +  1.  If  (?  <  c  go  back  to  2. 


Fig.  2.  Class  Supervised  SOMP  (CSSOMP). 


2.3  Learning  the  Dictionary:  The  Complete  Model 

In  order  to  learn  dictionaries  that  are  also  discriminant,  we  define  the  Sparse 
Discriminant  Dictionary  Problem: 


J 


j=i  i^i 


max 

D.a 


(2) 
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subject  to  llctillo  <  -b,  Vf,j.  In  contrast  with  the  energies  in  previous  sections, 
the  optimization  is  both  over  the  dictionary  D  and  the  sparse  representation 
over  it,  a.  li  9  =  0,  we  obtain  the  reconstruction  only  formulation  in  [1]. 

To  address  this  optimization  problem,  we  extend  the  K-SVD,  an  algorithm 
for  learning  overcomplete  non-parametric  dictionaries  for  sparse  representation 
[1].  Its  objective  is  to  design  a  dictionary  such  that  the  reconstruction  error  over 
a  set  of  signals,  when  coded  sparsely,  is  minimal.  This  is  achieved  through  an 
iterative  process  which  alternates  a  Sparse  Coding  stage  which  follows  the  clas¬ 
sical  OMP,  and  a  Dictionary  Update  stage  derived  from  simple  SVD  (each  atom 
is  updated  to  improve  the  reconstruction  of  those  signals  that  use  it).  Our  pro¬ 
posed  algorithm  modifies  the  Sparse  Coding  Stage,  which  is  now  performed  by  a 
CSSOMP  instead  of  OMP,  adding  the  discrimination  component,  and  obtaining 
the  Supervised  K-SVD  (SKSVD),  Figure  3. 


Input:  initial  dictionary  matrix  D  (0)  ^  signals  {{x(}rii}  sparsity 

level  L. 

Output:  trained  dictionary  D  and  sparse  representation  a. 

SKSVD:  Set  J  =  1.  Repeat  nntil  convergence: 

—  Sparse  coding  stage:  Use  CSSOMP  to  compute  the  sparse  representation  co¬ 
efficient  vectors  for  each  signal 

—  Dictionary  Update  Stage:  For  each  column  fc  =  1, ...,  K  in  update  by 

-  Define  the  group  of  examples  that  use  this  atom  ujj  {*|1  <  *  < 
K,aUi)^0} 

-  Compute  the  overall  representation  error  matrix,  Ek  :=  X  — dj  •  ali,. 

-  Restrict  Ek  selecting  only  the  colnmns  corresponding  to  oJk  and  obtain 

E^. 

-  Apply  SVD  decomposition  E^  =  UAV'^.  Select  the  updated  dictionary 
column  dk  to  be  the  fist  colnmn  of  U.  Update  the  coefficient  vector 

to  be  the  first  column  of  V  multiplied  by  A(i_i).“ 

—  Set  J  =  J  -b  1 

“  This  step  minimizes  the  reconstruction  error  for  the  group  of  signals  correspond¬ 
ing  to  tOj. 

Fig.  3.  Supervised  K-SVD  (SKSVD). 

The  proposed  CSSOMP  permits  to  maximize  the  energy  according  to  all  of 
the  al,  keeping  in  mind  the  global  aspect  of  the  discrimination  term.  At  the 
same  time  we  incorporate  the  prior  of  a  coherent  structure  in  a  class.  This  is  not 
only  obtained  through  the  simultaneous  decomposition  of  each  class,  but  also 
from  the  transmission  of  the  information  between  both  coding  and  dictionary 
update  stages.  All  signals  from  the  same  class  will  use  the  same  atoms,  and  thus 
the  Dictionary  Update  Stage  accounts  for  the  general  internal  structure  of  the 
class  (specially  if  the  sparsity  is  kept  small  compared  to  the  dimension).^ 

®  Explicitly  incorporating  an  additional  discrimination  term  into  the  dictionary  update 

step  is  the  subject  of  parallel  efforts,  [19]. 
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Since  the  proposed  framework  permits  the  use  of  common  atoms  by  multiple 
classes,  the  inner  common  structures  of  the  whole  ensemble  of  atoms  will  be 
learned  by  those  atoms  used  by  multiple  classes.  Finally,  compared  to  selecting 
a  set  of  atoms  simply  through  CSSOMP,  the  learning  will  adapt  to  any  desired 
amount  of  atoms,  being  fixed  at  first,  so  that  we  force  during  the  learning  stage 
the  use  of  atoms  by  multiple  classes.  We  have  thereby  eliminated  the  problem 
of  divergence  of  supports. 


3  Experimental  Results 

Experimental  analysis  to  demonstrate  the  importance  of  learning  discrimina¬ 
tive  and  representative  (non-parametric)  dictionaries  has  been  first  carried  out 
with  the  standard  MNIST  Handwritten  Digit  Database,  n  =  16  x  16  =  256 
dimensional  vectors.  This  is  done  to  demonstrate  the  importance  of  the  pro¬ 
posed  framework,  and  in  particular  of  learning  non-parametric  dictionaries  that 
reconstruct  and  discriminate.  We  then  show  results  on  natural  images. 

The  classification  tasks  have  been  performed  using  linear  SVMs  on  the  sparse 
representation  coefficients  (see  also  [9]),  coherently  with  our  criteria.  In  partic¬ 
ular  we  have  used  the  implementation  in  [3]  and  a  multiclass  one-against-one 
strategy  [10].  Following  [13],  and  given  the  large  amount  of  data,  parameters 
and  accuracies  have  been  estimated  through  10-fold  Cross-Validation.  SVMs  are 
trained  in  noiseless  conditions  and  tested  over  corrupted  data.  Then,  the  robust¬ 
ness  comes  from  the  representation  itself.^ 

The  particular  dictionaries  used  in  the  test  are:  (1)  union  of  DCT  and  Haar 
basis  (511  total  atoms),  which  are  very  well  adapted  to  this  database;  (2)  KSVD 
learned  dictionary  of  size  1031  and  sparsity  L  =  4  (no  discrimination  compo¬ 
nents);  and  (3)  the  introduced  SKSVD  learned  for  dimension  (total  number  of 
atoms)  50  and  sparsity  L  =  15,  yielding  an  under-complete  system.  Dictionaries 
are  learned  with  20000  images  and  the  different  algorithms  are  tested  with  about 
9000  images. 

The  Need  for  Simultaneous  Decompositions-  The  first  idea  we  could  have 
is  to  directly  use  the  coefficients  of  a  sparse  decomposition  through  OMP  for  clas¬ 
sification,  under  the  hypothesis  that  different  classes  will  have  different  supports. 
Unfortunately,  this  does  not  hold.  First  of  all,  a  sparse  representation  over  over¬ 
complete  dictionaries  has  the  problem  of  multiple  representations.  Through  the 
SOMP  we  reduce  the  size  of  the  dictionaries  (to  50),  thereby  addressing  this  first 
problem.  Even  then,  for  all  of  the  dictionaries,  the  distribution  of  Hamming  dis¬ 
tances  among  supports  (a-s)  is  very  similar  within  one  class  and  between  classes. 
Even  worse,  the  average  distance  is  close  to  the  maximum  possible.  Learning 
the  dictionary  with  KSVD,  and  in  particular  with  the  proposed  SKSVD,  yields 

There  is  a  link  between  SVMs  featnre  selection  techniques  and  our  discrimination 
term.  This  could  be  interpreted  as  incorporating  the  F-Score  criteria  itself  in  the  de¬ 
sign  of  the  dictionary.  However,  those  techniques  do  not  take  into  account  possible 
correlations  between  variables  and  reconstruction  properties,  SVMs  is  a  discrimina¬ 
tive  approach. 


Sparse  Representations  for  Image  Classification 


9 


significantly  better  results  than  the  fixed  dictionary  one,  but  this  is  still  not 
sufficient  for  classification. 

The  underlying  problems  are  the  non  ideality  of  OMP  and  the  over-completeness 
of  our  dictionaries.  Consider  the  images  for  the  class  corresponding  to  the  num¬ 
ber  “1.”  There  may  well  be  15  atoms  of  the  KSVD  dictionary  that  describe  the 
ensemble  accurately  and  show  the  internal  coherence  within  the  class.  However, 
when  one  performs  an  OMP  decomposition  for  a  single  number  “1,”  the  over¬ 
completeness  of  the  dictionaries  and  their  good  reconstruction  capacity  may  well 
drive  the  greedy  selection  towards  different  atoms  tailored  specifically  for  it.  This 
is  illustrated  in  Figure  4  for  the  DCT-|-Haar  dictionary.  In  the  case  of  OMP,  after 
5  atoms,  the  algorithm  selects  highly  localized  Haar  atoms,  in  order  to  describe 
small  details.  The  reconstructions  are  not  natural  and  focus  on  certain  areas 
of  the  images.  Details  are  important  for  reconstruction,  but  not  necessarily  for 
classification,  since  they  are  often  associated  to  the  intra-class  variation.  How¬ 
ever,  in  the  case  of  SOMP  over  two  classes  together  (Figure  4-center),  and  each 
one  of  them  separately  (Figure  4-bottom),  all  the  atoms  are  dedicated  to  the 
general  shape,  mostly  DCTs.  The  reconstructions  are  blurred,  with  no  details, 
but  keep  the  essential  structure  of  the  number/image  (class).  They  achieve  the 
extraction  of  the  common  internal  structure.  This  is  why  simultaneous  decom¬ 
position  is  so  important  in  classification.  The  difference  between  SOMP  global 
and  per-class  is  not  very  high  in  this  particular  example  since  there  are  only  two 
classes.  Nevertheless,  it  is  remarkable  that  for  the  per-class  over  “l”-s,  instead 
of  selecting  as  first  atom  the  DC  one,  it  selects  a  DCT  that  has  the  shape  of  a 
vertical  stroke. 

The  sparsity  has  to  be  kept  small  not  to  capture  those  intra-class  variations 
associated  to  details  and,  at  the  same  time,  the  maximal  number  of  atoms  that 
could  be  selected  is  c-  L.  Previous  arguments  show  that  as  a  matter  of  fact  the 
number  should  be  much  smaller  than  this  quantity.  Since  multiple  representa¬ 
tion  is  a  major  concern,  we  train  highly  under-complete  dictionaries.  Once  those 
atoms  (dictionary)  are  selected,  multiple  characterizations  could  then  be  envi¬ 
sioned,  such  as  correlation  with  those  atoms,  OMP  with  fixed  error  instead  of 
sparsity,  and  OMP  with  fixed  sparsity.  Even  when  the  set  is  under-complete,  the 
problem  with  OMP  remains  and  correlation  does  not  do  better.  The  right  de¬ 
scriptor  are  then  the  coefficients  of  orthogonal  projection  over  the  span  of  those 
atoms.®  In  fact,  our  criteria  treats  the  intra-class  variations  as  noise,  trying  not 
to  capture  them  in  the  selected  subset  of  atoms.  Orthogonal  projection  follows 
this  objective  and  unifies  the  representation. 

Robustness  of  the  Ftamework-  In  order  to  study  the  robustness  of  our  ap¬ 
proach  and  to  compare  it  with  the  fixed  dictionary,  we  test  the  classification 
performance,  for  the  MNIST  data,  under  noisy  (additive  Gaussian  noise)  and 
random  occlusion  conditions.  For  each  dictionary  we  obtain  9  different  repre¬ 
sentations  (for  the  different  types  of  dimensionality  reduction),  with  dictionary 
sizes  of  15,30,  and  50,  both  for  the  SOMP  and  SSOMP  {9  =  1.5  and  9  =  2). 

®  There  is  an  interesting  link  with  the  work  on  the  Dantzig  Detector  [2],  Under  noisy 
conditions,  this  detector  is  used  in  sparse  representation  to  first  select  the  atoms 
whose  representation  coefficients  are  not  zero,  and  then  projects  the  signal  orthogo¬ 
nally  over  their  span,  as  we  do  here.  This  increases  the  performance  of  the  algorithm. 
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Fig.  4.  OMP  vs  SOMP  vs  Class-SOMP  for  a  fixed  dictionary.  Left:  test  images.  Top- 
center:  atoms  selected  by  OMP  for  the  images  of  zeros  and  ones  in  the  order  they  have 
been  selected.  Top-right:  10-sparse  reconstructions  form  OMP  coefficients.  Center- 
center:  atoms  selected  by  SOMP  for  both  of  the  classes  together,  in  the  order  they 
have  been  selected.  Center-right:  reconstructions  from  10-sparse  SOMP  coefficients. 
Bottom-center:  atoms  selected  by  SOMP  for  each  one  of  the  classes  separately,  in 
the  order  they  have  been  selected.  Bottom-right:  reconstructions  from  10-sparse  Class- 
SOMP  coefficients 


Each  scenario  is  tested  according  to  the  following  protocol:  The  testing  ensemble 
is  divided  into  ten  blocks,  each  one  containing  the  same  number  of  images  from 
each  class.  For  each  one  of  the  blocks  we  train  one  ensemble  of  SVMs 

according  to  a  one-against-one  scheme  with  the  coefficients  of  the  rest  of  the 
blocks  and  using  the  parameters  already  fixed  in  advance  (the  training  data  is 
noiseless).  We  then  repeat  5  times  the  following:  Corrupt  with  noise  or  occlude 
the  signals  in  the  block  bi,  and  then  project  them  orthogonally  over  the  subset  of 
the  dictionary  and  perform  classification  over  those  coefficients  (the  testing  data 
is  now  corrupted).  We  average  the  accuracies  of  the  5  repetitions  and  save  them 
as  accuracy  of  test  over  the  block.  We  them  average  the  accuracy  results  among 
the  10  executions.  This  procedure  is  a  combination  of  10-fold  cross-validation 
with  averaging  due  to  the  randomness  of  the  noise  and  occlusion.  The  SVM  pa¬ 
rameters  have  been  previously  estimated  through  10-fold  cross-validation  over 
a  training  set  of  10000  images,  different  from  the  one  used  to  perform  SOMP 
or  SSOMP.  As  we  mentioned,  the  representation  will  have  to  deal  with  all  the 
distortions  introduced. 


Both  the  results  for  noisy  (Figure  5-left)  and  occluded  (Figure  5-right)  im¬ 
ages  show  three  major  points.  First  of  all,  results  are  almost  equivalent  for  all 
algorithms  for  high  dimension  (30  and  50).  In  those  cases,  most  of  the  informa¬ 
tion  has  already  been  captured  and  thus  improvement  is  not  really  possible.  For 
complex  datasets  with  less  visual  coherence,  such  high  dimensions  will  capture 
too  much  intra-variance.  For  a  dictionary  of  dimension  30,  the  pre-defined  die- 
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(a)  DCT  and  Haar  (b)  KSVD 


fal  DCT  and  Haar  IbIKSVD  (cISKSVD 


Fig.  5.  Left:  Classification  under  noisy  conditions.  SNR  in  dB  in  abscises  (25  dB  is 
noiseless),  accuracy  in  ordinates.  From  left-to-right:  (a)  DCT  and  Haar,  (b)  KSVD 
and  (c)  SKSVD.  In  each  one  of  the  cases  we  perform  analysis  at  L  =  15,30,50.  The 
selection  is  done  by  SOMP,  SSOMP  with  9  =  1.5,  and  SSOMP  with  9  =  2.  Right: 
Classification  under  occlusion  conditions.  Size  of  the  occlusion  in  abscises,  accuracy  in 
ordinates.  Same  order  as  for  noisy  conditions.  (This  is  a  color  figure.) 

tionary  shows  a  breakdown  for  SNR  <  10,  proving  that  learning  a  dictionary 
via  SKSVD  is  more  robust.  When  going  from  dimension  256  (the  image  di¬ 
mension)  to  15  (the  dictionary  dimension),  SKSVD  derived  dictionaries  provide 
significantly  more  robustness  than  the  fixed  one.  For  example,  when  decreasing 
the  SNR  from  noiseless  to  5  dB,  the  accuracy  variation  is  smaller  than  5%  for 
SKSVD,  whereas  for  the  fixed  dictionary  is  more  than  15%.  This  implies  that 
learning  a  dictionary  provides  a  much  more  accurate  description  of  the  internal 
structure  of  the  class,  capturing  much  of  the  manifold  the  signals  belong  to. 

Secondly,  we  verify  the  relevance  of  the  discrimination  power  in  the  SSOMP. 
For  dimension  15  for  example,  the  SSOMP  over  the  fixed  dictionary  achieves 
similar  performance  (slightly  lower)  to  that  of  SKSVD  in  nearly  noiseless  sit¬ 
uations.  However,  when  noise  is  introduced,  the  representation  remains  highly 
un-robust,  the  performance  falling  by  more  than  10%  relatively  to  the  others  for 
the  same  distortion  level.  It  is  clear  that  robustness  comes  form  adaptation  of 
the  dictionary  to  the  signals,  learning  is  essential. 

Thirdly,  SOMP  over  SKSVD  produces  more  accurate  classification  than  over 
KSVD,  since  the  discrimination  component  has  been  included  in  the  dictionary 
learning  itself.  When  coupled  with  SSOMP,  both  learning  strategies  yield  similar 
results.  This  implies  that  the  incorporation  of  the  discrimination  measure  is 
highly  important,  but  the  gain  in  each  of  the  stages  seems  to  be  limited. 

We  have  also  compared  between  learning  the  SKSVD  dictionary  and  then 
performing  dimensionality  reduction  and  learning  it  directly  at  the  proposed 
dimension.  The  obtained  results  (here  omitted  due  to  space  limitations)  show 
that  not  only  incorporating  the  per-class  SOMP  is  critical,  but  also  that  learning 
the  set  directly  into  the  final  dimension  yields  better  results  (with  improvements 
of  about  2%). 

These  results  have  clearly  shown  that  learning  dictionaries  under  combined 
discriminative  and  reconstructive  constraints,  as  provided  by  our  proposed  SKSVD 
framework,  is  critical  for  signal  classification,  in  particular  for  robustness  under 
common  distortions.  Let  us  close  with  an  illustration  of  the  dictionaries  learned 
by  our  proposed  technique.  Figure  6  presents  those  obtained  for  the  experiments 
in  Figure  4. 
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Fig.  6.  Dictionaries  of  size  15  obtained  through  SSOMP  9  =  2  over  DCTandHaar  (left), 
KSVD  (center)  and  SKSVD  (rigth) 

3.1  Natural  Images 


3.2  Working  with  Patches 

Following  the  rich  literature  on  object  recognition  and  scene  classification,  we 
need  to  extend  the  present  framework  to  work  on  local  patches.  For  the  digits 
dataset,  we  have  performed  that,  and  obtained  the  same  robust  classification 
results  described  in  previous  section,  actually  improving  by  5%  the  classification 
(with  a  ground  polynomial  metric).  In  particular,  all  8  x  8  patches  are  consid¬ 
ered,  and  their  signatures  according  to  our  proposed  framework  are  clustered. 
Following  this,  digits  are  compared  using  a  normalized  sum-of-kernels  distance 
between  the  corresponding  vector  signatures,  where  each  coordinate  in  the  vec¬ 
tor  is  provided  by  the  cluster  center  and  cluster  cardinality.  This  opens  the  door 
to  exploit  the  proposed  framework,  for  example,  in  the  form  of  bags- of -words 
models,  replacing  standard  SIFT-type  of  features  by  the  ones  explicitly  learned 
for  classification  with  our  method. 

Let  us  now  present  preliminary  results  in  this  direction  for  three  classes  shown 
in  Figure  7  (top-left)  from  the  Caltech  Categories  dataset.  We  first  perform  a 
standard  key-point  detector  based  on  the  Harris-Laplace  approach.  On  these 
patches,  we  run  our  proposed  learning  technique,  see  Figure  7  (top-right)  for 
examples  of  the  learned  dictionaries.  The  coefficients  corresponding  to  the  sparse 
representation  over  these  learned  dictionaries  become  the  local  discriminative 
feature  descriptors  for  each  patch  (in  contrast  for  example  with  SIFT).  Once  the 
local  descriptors  have  been  extracted,  we  perform  standard  K-means  and  obtain 
signatures  for  40  clusters.  The  metric  between  those  signatures  is  established 
with  the  Gaussian  extension  of  the  Earth  Movers  Distance  (EMD),  where  the 
spread  parameter  is  fixed  as  the  average  of  all  the  EMDs,  and  the  distance  is  the 
Euclidean  one  (linear  kernel).  We  normalize  the  weights  so  that  they  have  total 
weight  one,  thereby  representing  a  probability  distribution,  where  EMD  can  be 
interpreted  in  terms  of  the  work  to  transform  one  probability  distribution  into 
the  other.  Finally  we  perform  learning  and  classification  with  SVMs.  For  100 
samples  from  each  class,  the  10-fold  cross  validation  accuracy  has  been  of  94%. 
For  5  classes,  adding  leaves  and  cars  rear,  we  obtained  a  classification  of  89.7%. 
Following  the  success  of  this  simple  local  study,  which  does  not  depend  on  ad- 
hoc  features  but  formally  computes  the  local  descriptors  following  the  energies 
provided  by  our  framework,  global  structures  (as  currently  done  in  the  literature 
for  the  ad-hoc  features),  will  prove  to  be  further  discriminant,  inheriting  also  the 
robustness  of  the  learned  local  representations.  Further  analysis  will  be  carried 
out  in  this  direction  and  results  will  be  reported  elsewhere. 
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Fig.  7.  Examples  from  the  classes  faces,  airplanes,  and  motorbikes  from  Caltech  Cat¬ 
egories  (top-left);  atoms  learned  with  our  technique  from  key  patches  (top-right). 

4  Concluding  Remarks 


An  energy  based  framework  for  learning  dictionaries  for  simultaneous  sparse 
signal  representation  and  robust  class  classification  has  been  introduced  in  this 
paper.  This  energy  is  minimized  by  a  class-dependent  simultaneous  orthogo¬ 
nal  matching  pursuit  interleaved  with  an  efficient  dictionary  update.  We  have 
contributed  to  the  understanding  of  learning  sparse  representations  for  signal 
classification,  and  showed  the  relevance  of  learning  dictionaries  to  achieve  accu¬ 
rate  and  robust  classification.  We  demonstrated  that  performing  simultaneous 
decomposition  per  class  is  essential  in  order  to  extract  the  internal  structure  of 
the  class.  The  orthogonal  projection  over  dictionaries  increases  robustness  and 
unifies  the  representation  of  signals.  We  further  demonstrated  that  learned  dic¬ 
tionaries  outperform  fixed  ones  in  classification  tasks,  in  particular  for  distorted 
data  and  compact  representations.  Current  work  is  concentrated  in  the  topic  of 
local  patches,  following  the  promising  preliminary  results  described  above. 
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